Evaluating Common De-Identification Heuristics for Personal Health Information
نویسندگان
چکیده
BACKGROUND With the growing adoption of electronic medical records, there are increasing demands for the use of this electronic clinical data in observational research. A frequent ethics board requirement for such secondary use of personal health information in observational research is that the data be de-identified. De-identification heuristics are provided in the Health Insurance Portability and Accountability Act Privacy Rule, funding agency and professional association privacy guidelines, and common practice. OBJECTIVE The aim of the study was to evaluate whether the re-identification risks due to record linkage are sufficiently low when following common de-identification heuristics and whether the risk is stable across sample sizes and data sets. METHODS Two methods were followed to construct identification data sets. Re-identification attacks were simulated on these. For each data set we varied the sample size down to 30 individuals, and for each sample size evaluated the risk of re-identification for all combinations of quasi-identifiers. The combinations of quasi-identifiers that were low risk more than 50% of the time were considered stable. RESULTS The identification data sets we were able to construct were the list of all physicians and the list of all lawyers registered in Ontario, using 1% sampling fractions. The quasi-identifiers of region, gender, and year of birth were found to be low risk more than 50% of the time across both data sets. The combination of gender and region was also found to be low risk more than 50% of the time. We were not able to create an identification data set for the whole population. CONCLUSIONS Existing Canadian federal and provincial privacy laws help explain why it is difficult to create an identification data set for the whole population. That such examples of high re-identification risk exist for mainstream professions makes a strong case for not disclosing the high-risk variables and their combinations identified here. For professional subpopulations with published membership lists, many variables often needed by researchers would have to be excluded or generalized to ensure consistently low re-identification risk. Data custodians and researchers need to consider other statistical disclosure techniques for protecting privacy.
منابع مشابه
Influence of Module Order on Rule-Based De-identification of Personal Names in Electronic Patient Records Written in Swedish
Electronic patient records (EPRs) are a valuable resource for research but for confidentiality reasons they cannot be used freely. In order to make EPRs available to a wider group of researchers, sensitive information such as personal names has to be removed. Deidentification is a process that makes this possible. Both rule-based as well as statistical and machine learning based methods exist t...
متن کاملEvaluation Measures for Detection of Personal Health Information
Texts containing personal health information reveal enough data for a third party to be able to identify an individual and his health condition. Detection of personal health information in electronic health records is an essential part of record deidentification. Performance evaluation in use today focuses on method’s ability to identify whether a word reveals personal health information or not...
متن کاملA de-identifier for medical discharge summaries
OBJECTIVE Clinical records contain significant medical information that can be useful to researchers in various disciplines. However, these records also contain personal health information (PHI) whose presence limits the use of the records outside of hospitals. The goal of de-identification is to remove all PHI from clinical records. This is a challenging task because many records contain forei...
متن کاملAdoption of Electronic Personal Health Records in Canada: Perceptions of Stakeholders
Background Healthcare stakeholders have a great interest in the adoption and use of electronic personal health records (ePHRs) because of the potential benefits associated with them. Little is known, however, about the level of adoption of ePHRs in Canada and there is limited evidence concerning their benefits and implications for the healthcare system. This study aimed to describe the current ...
متن کاملStrategies for de-identification and anonymization of electronic health record data for use in multicenter research studies.
BACKGROUND De-identification and anonymization are strategies that are used to remove patient identifiers in electronic health record data. The use of these strategies in multicenter research studies is paramount in importance, given the need to share electronic health record data across multiple environments and institutions while safeguarding patient privacy. METHODS Systematic literature s...
متن کامل